Hive: Bucketing

In Apache Hive, for decomposing table data sets into more manageable parts, it uses Hive Bucketing concept.Basically, for decomposing table data sets into more manageable parts, Apache Hive offers another technique. That technique is what we call Bucketing in Hive.

Why Bucketing?

Basically, the concept of Hive Partitioning provides a way of segregating hive table data into multiple files/directories.However, it only gives effective results in few scenarios. Such as:

When there is the limited number of partitions.
while partitions are of comparatively equal size.

For example when are partitioning our tables based geographic locations like country. Hence, some bigger countries will have large partitions (ex: 4-5 countries itself contributing 70-80% of total data). While small countries data will create small partitions (remaining all countries in the world may contribute to just 20-30 % of total data). Hence, at that time Partitioning will not be ideal.Then, to solve that problem of over partitioning, Hive offers Bucketing concept.It is another effective technique for decomposing table data sets into more manageable parts.

Bucketing is another technique for decomposing data sets into more manageable parts. For example, suppose a table using date as the top-level partition and employee_id as the second-level partition leads to too many small partitions. Instead, if we bucket the employee table and use employee_id as the bucketing column, the value of this column will be hashed by a user-defined number into buckets. Records with the same employee_id will always be stored in the same bucket. Assuming the number of employee_id is much greater than the number of buckets, each bucket will have many employee_id. The number of buckets is fixed so it does not fluctuate with data. If two tables are bucketed by employee_id, Hive can create a logically correct sampling. Bucketing also aids in doing efficient map-side joins etc.

Now, let’s assume a condition that there is a huge dataset. At times, even after partitioning on a particular field or fields, the partitioned file size doesn’t match with the actual expectation and remains huge and we want to manage the partition results into different parts. To overcome this problem of partitioning, Hive provides Bucketing concept, which allows user to divide table data sets into more manageable parts. Thus, Bucketing helps user to maintain parts that are more manageable and user can set the size of the manageable parts or Buckets too.

How Hive distributes the rows into buckets?

Hive determines the bucket number for a row by using the formula: hash_function (bucketing_column) modulo (num_of_buckets) Here, hash_function depends on the column data type. For example, if you are bucketing the table on the basis of some column, let’s say user_id, of INT datatype, the hash_function will be – hash_function (user_id)= integer value of user_id. And, suppose you have created two buckets, then Hive will determine the rows going to bucket 1 in each partition by calculating: (value of user_id) modulo (2). Therefore, in this case, rows having user_id ending with an even integer digit will reside in a same bucket corresponding to each partition. The hash_function for other data types is a bit complex to calculate and in fact, for a string it is not even humanly recognizable.

Every Bucket Create one file in HDFC. For every file HDFS divides into blocks of size 128 MB.Generally hive create one file for one bucket and for every bucket one reducer is available.If you are creating many bucket and every bucket size is less than128 MB the it will impact the performance.So the bucket size will be equal or bigger than block size(128MB).

Bucketing Features in Hive

Bucketing concept is based on (hashing function on the bucketed column)mod(by total number of buckets). The hash_function depends on the type of the bucketing column.
Records with the same bucketed column value will be always be stored in the same bucket.
We use CLUSTERED BY clause to divide the table into buckets.
Physically, each bucket is just a file in the table directory.
Bucketing can be done along with Partitioning on Hive tables and even without partitioning.
Bucketed tables will create almost equally distributed data file parts.

Steps Involved in Creating Bucketing

Create Temporary Table
Load data into Temporary Table
Created Bucked Table
Enable Bucketing
Insert data to bucked table from Temporary table.
Drop Temp table.

Temp table in Hive
CREATE TABLE customers (
customer_id int,
customer_fname string,
customer_lname string,
customer_email string,
customer_password string,
customer_street string,
customer_zipcode string,
customer_state string,
customer_city string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 'warehouse/customers_part'

Bucketed table in Hive

CREATE TABLE customers_bucket (
customer_id int,
customer_fname string,
customer_lname string,
customer_email string,
customer_password string,
customer_street string,
customer_zipcode string,
customer_state string,
customer_city string
) CLUSTERED BY (customer_state) INTO 4 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 'warehouse/customers_bucket'

Load data in Buckted Table
insert into customers_bucket select * from customers

Enabling Bucketing
set hive.enforce.bucketing = true;

Select the record form the Table

select * from customers_bucket TABLESAMPLE(BUCKET 1 OUT OF 4 ON customer_state);
select * from customer_state TABLESAMPLE(10 percent);

CREATE TABLE dwdev.Emp_Bucket
(
empno       int,
ename       varchar(20),
job             varchar(20),
sal             float,
comm        float,
deptno      int
) PARTITIONED BY (Loc VARCHAR(64))
   CLUSTERED BY (deptno) INTO 4 BUCKETS
   ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
   LINES TERMINATED BY '\n' STORED AS TEXTFILE
   LOCATION '/user/cloudera/data/hive_data/';

insert overwrite table Emp_Bucket partition(loc) select * from emp;

select * from Emp_Bucket TABLESAMPLE(BUCKET 1 OUT OF 4 ON deptno);
select * from partition_bucket TABLESAMPLE(BUCKET 1 OUT OF 4 ON gender);

select * from partition_bucket TABLESAMPLE(BUCKET 2 OUT OF 4 ON gender);
select * from bucket_patient TABLESAMPLE(10 percent);

select * from dv_pdpchmoph_nogbd_r000_wh.customers_bucket TABLESAMPLE(BUCKET 1 OUT OF 4 ON customer_state)
where customer_fname='Willim' ;

Advantages
Fast Map side Joins – If two tables are bucketed by the same column(s) into same number of buckets and the join is performed on the bucketed column(s), then hive can do efficient map side join by reading the same bucket from both the tables and performing a join, as all the data for similar records will be present in the corresponding bucket from both the tables. If the records are sorted inside each bucket, then hive can join the data using merge, which is a linear time operation. Bucketing will help only when the join key and bucketing key are the same.

Efficient Group by – If the group by is performed on the bucketed column(s), then aggregations can be performed in the combiner. This will reduce network traffic by sending less data to reducers.

Sampling – Using Bucketing we can run queries on a sample of data from the table. This is beneficial while testing, so that we need not run our queries on whole data.

How to decide the number of bucket in hive table?

buckets = (x * Average_partition_size) / JVM_memory_available_to_your_Hadoop_tasknode

if JVM memory is less then you have to create more bucket, if your partition size more the also you have create more bucket.

X is tuning parameter

How does bucketing help in the faster execution of queries?

If you have to join two large tables, you can go for reduce side join. But if both the tables have the same number of buckets or same multiples of buckets and also sorted on the same column there is a possibility of SMBMJ in which all the joins take place in the map phase itself by matching the corresponding buckets.

Buckets are basically files that are created inside the HDFS directory.

There are different properties which you need to set for bucket map joins and they are as follows:

    set hive.enforce.sortmergebucketmapjoin = false;
    set hive.auto.convert.sortmerge.join = false;
    set hive.optimize.bucketmapjoin = true;
    set hive.optimize.bucketmapjoin.sortedmerge = true;

What are the factors to be considered while deciding the number of buckets?
One factor could be the block size itself as each bucket is a separate file in HDFS. The file size should be at least the same as the block size.The other factor could be the volume of data. In fact, these two factors go together.

At the time of table creation, the data volume may not be known. In such case, what is the recommended number of buckets? And as and when more data gets ingested, it may be that the number of buckets chosen at the time table of table creation may have to be increased. Is it possible to increase the number of buckets periodically so that the data gets balanced.

Hive

Bucketing

No comments:

Post a Comment